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Abstract 



CIh' In this paper,we consider the restless bandit problem, which is one of the most well-studied generaliza- 



tions of the celebrated stochastic multi-armed bandit problem in decision theory. However, it is known be 
PSPACE-Hard to approximate to any non-trivial factor Thus the optimaUty is very difficult to obtain due to 
its high complexity. A natural method is to obtain the greedy policy considering its stability and simplicity. 
However, the greedy policy will result in the optimahty loss for its intrinsic myopic behavior generally. 
In this paper, by analyzing one class of so-called standard reward function, we establish the closed-form 



C/3 

O ^ I condition about the discounted factor /3 such that the optimality of the greedy policy is guaranteed under 

the discounted expected reward criterion, especially, the condition P ~ 1 indicating the optimality of the 
greedy policy under the average accumulative reward criterion. Thus, the standard form of reward function 
Q^ ' can easily be used to judge the optimality of the greedy policy without any complicated calculation. Some 

m ; 

ly-j , examples in cognitive radio networks are presented to verify the effectiveness of the mathematical result 

'sj" . in judging the optimality of the greedy policy. 

o 

^ ■ 

Index Terms 

, Partially observed Markov decision process (POMDP), multi-armed restless bandit problems, optimal- 

5-H ■ ity, greedy policy, cognitive radio 

I. Introduction 

We consider the system consisting of n uncontrolled Markov chains evolving independently 
in the discrete time. Each of those chains is an independent identically-distributed (iid) two-state 
Markov process. The two states will be denoted as "good" state (state 1) and "bad" state (state 
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0). The transition probabilities is Pij,i,j = 0, 1. In each time instance of the system, a user is 
allowed to select k out of the n process according to its strategy, and to observe their states 
(assuming the precise observation), while those processes not selected by the user will evolve 
according to their rules. The user would obtain some reward determined by the combination of 
those observed states of the k selected processes, i.e. collecting no reward if those states of k 
processes are observed "bad". The above selecting, observing, and collecting process repeats until 
the user does not access the system. Obviously, it is a multi-armed bandit (MAB) problem [|T] as 
well as partially observed Markov decision process (POMDP) problem which has been used and 
studied in the [?] |l2l. Unfortunately, obtaining optimal solutions to a general restless bandit process 
is PSPACE-Hard and analytical characterizations of the performance of the optimal policy are 
often intractable. Hence the greedy policy governing the channel selection is the suitable choice 
because it only focuses on maximization of the immediate reward ignoring its affect on the future 
reward. However, the greedy policy is not optimal generally. 

Thus, recently arise two main research directions addressing the greedy policy of this kind 
of MAB problem. The first one is to seek the constant-factor approximation algorithm, such as 
68-approximation flU developed via the linear programming relaxation under the condition of 
Pii > 0.5 > poi for each arm, and 2-approximation policy for a class of monotone restless bandit 
problem {5]. The relevant application in dynamic multichannel access is the paper [6], where the 
authors established the indexability and obtained Whittle index in closed form for both discounted 
and average reward criteria. Another research direction is to explore the optimal condition of 
greedy policy corresponding to a concrete application or scenario. Our work follows on this line. 
Although many literatures have studied this problem, the immediate reward function in those wroks 
only focuses on the linear combination of those observed states, i.e. in Q, the optimality of the 
greedy policy was proved in choosing A; = 1 of channels in the case of positively correlated 
channels, and then extended to arbitrary k channels in [8]. In our previous work [|9l, nevertheless, 
we have extended the work in HT] on another line to the scenario where the immediate reward 
function is the simplest non-linear combination of observed states, and proved that the greedy 
policy is not optimal generally, which is contrary to the result of [|8l where the immediate reward 
function is the linear combination of observed states. The contrary conclusion make it necessary 
to study affect of the immediate reward function on the optimality of greedy policy, which is one 
of the major incentives for this paper. 
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From the technical perspective, the optimality of greedy policy needs user prefer to exploit rather 
than to explorer. One simplest approach to implement this mechanism is to adjust the balance 
between exploitation and exploration by the discounted factor (3. On the other hand, noticing the 
different conclusion resulting from the nuance of immediate reward functions [HI [|9l, then we only 
focus on one generic and basic class of immediate reward function formulated by the combination of 
variables of order 1, referred to as standard reward function. Therefore, our objective is to derive the 
sufficient condition of the discounted factor such that the greedy policy is guaranteed to be optimal 
for the so-called standard reward function under the discounted accumulative reward criterion. If 
the discounted factor (3 = 1, the optimality of greedy policy for the discounted accumulative reward 
can be promoted to the optimality for the average expected reward on the time horizon of interest. 
Therefore, we can judge the optimality of the greedy policy for the discounted accumulative and 
average expected reward according to the closed-form condition of /3. To the best of our knowledge, 
very few results been reported from this perspective. 

Compared with other existing works on the optimality of greedy policy in MAB problem, and 
our contribution is three-fold: 

• We analyze one special class of MBA problem where the immediate reward function is 
so-called standard one, and derive that the discounted accumulative reward function also is 
standard reward function. Furthermore, we establish the optimality of greedy policy under the 
discounted accumulative reward criterion when pu > poi. The theoretical results demonstrate 
that the greedy policy choosing the best 1 or — 1 out of A^ channels is optimal when 
< /3 < 1. For the case of choosing k (1 < k < N — 1) channels, the greedy policy is 
optimal only when the discounted factor satisfies a simple closed-form condition. 

• The major technique developed in this paper is largely based on the analytic properties of 
standard reward function, completely different from E] [El relying on the coupling argument. 
Besides significant and practical application in cognitive radio networks, this technique serves 
as the key criterion to judge the optimality of greedy policy when the immediate reward 
function is the combination of the standard functions in other scenarios. 

• We analyze two practical models in the cognitive radio networks. The first model in cognitive 
radio networks involves the sensing order problem where the secondary user selects k (1 < 
k < N) of N channels in order to maximize the probability of finding an idle channel. It 
is obvious that the immediate reward function is the order 1 non-linear combination of the 
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availability probabilities of selected channels. The result demonstrates that the greedy policy 
is not optimal generally under the average expected reward, which is coherent with [|9l. The 
second model is that a user chooses k(l < k < N) channels to access and receive a reward 
on the channel in good state. The immediate reward function is the linear combination of the 
availability of those selected channels. Our derived result is consistent with that in fTl If8l 
where the myopic policy choosing any number of channels is optimal. 
The rest of the paper is organized as follows: Our model is formulated in Section [III Section UlI] 
analyzes standard reward function. Section |IV] gives the optimality theorem of the myopic policy. 
Three applications are given in Section |V] Finally, our conclusions are summarized in Section |Vll 

IL Problem Formulation 

As outlined in the introduction, we consider a user trying to access the system consisting of 
n independent and statistically identical channels, each given by a two state Markov chain. The 
set of n channels is denoted by J\f, each indexed by i = l,2,...,n, and the state of channel i 
denoted by Si{t) = {1 (good),0 (bad)}. The system operates in discrete time steps indexed by t 
(t = 1, 2, T), where T is the time horizon of interest (or the user gives up accessing the system). 
Specifically, we assume that channels go through state transition at the beginning of slot t and then 
at time t the user makes the channel selection decision. Limited by hardware or sensing policy, at 
time t the user is allowed to choose A; (1 < A; < n) of the n channels to sense, the chosen channel 
set denoted by a^(t) C Af, \a^(t)\ = k. 

Obviously, the user cannot observe the whole states S{t) = [0, 1]" of the underlying system (i.e., 
the states of n channels). We know that a sufficient statistic of such a system for optimal decision 
making, or the information state of the system, is given by the conditional probabilities of the 
state each channel is in given all past actions and observations [?]. We denote this information 
state (also called belief vector) by fi(t) = [uji{t), G [0, 1]", where uJi{t) is the conditional 
probability that channel i is in state 1 at time t given all past states, actions and observations. In the 
rest of the paper, cui (t) will be referred to as the information state of channel i at time t, or simply 
the channel probability of i at time t. Due to the Markovian nature of the channel model, the 
future information state is only a function of the current information state and the current action, 
i.e., it is independent of past history given the current information state and action. Given that the 
information state at time t is ^l{t) = {ui{t),i E J\f} and the sensing policy a^'(t) C A/" is taken. 
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the state at time t+1 can be updated using Bayes Rule as shown in Q). 

Pn, i e a''{t),S,{t) = 1 

^iit + 1) = < pou I e a\t), S,{t) = • (1) 

where, T^Uiit)) = Ui{t)pn + [1 - Wi(t)]poi- 

The objective is to maximize the discounted accumulative reward over a finite horizon given in 

the following problem: ^ 

maxF^[V/3*i?^,(fi(t))|l](l)] (2) 

t=l 

where RTrt{^l{t)) is the reward collected under state ^l{t) when channels in the set a^(t) = 7rt{^l{t)) 
are selected, tt^ specifies a mapping from the current information state fi(t) to a channel selection 
action a'=(t) = Trt{n{t)) C Af. 

Let Vt(f2) be the value function, which represent the maximum expected discounted accumulative 
reward obtained from t to T given the initial belief vector Q. Let poi[x] and pu[x] denote the vector 
[poi, ■ ■ ■ ,Poi] and [pii, ■ ■ ■ ,pu] of length x. Thus, we arrive at the following optimality equation: 

VT{n{t)) = max E[R{n{t))] = max F{n{t)) (3) 

Vt{n{t)) = max [F{n{t)) + /3Kt{n{t))] (4) 

Kt{Q{t))= J2 n (l-^i)V'm(Pii[|e|],r(a;fc+i(t)),--- ,rK(t)),Poi[^-|^(|?) 

where, V{a''{t)) represents the power set generated by the set a''{t), the expected immediate reward 
is F : — )■ R, and e is the cardinality of set e. On right side of the above formulation dD, 
the reward that can be collected from slot t consists of two parts: the expected immediate reward 
F(Q(t)) and the future discounted accumulative reward (3Kt{^l{t)) calculated by summing over all 
possible realizations of the k selected channels. In the channel state probability vector 

consists of three parts: a sequence of pn's indicating those channels sensed to be in state 1 at time 
t; a sequence of values T{uj) for all j ^ a''; and a sequence of poi's indicating those channels 
sensed to be in state at time t. 

Considering the computational complexity of the recursive structure ©, we should seek other 
policies but not optimal policy. One of the simplest approach is a greedy policy where at each 
time step the objective is to maximize the expected immediate reward F(^l(t)). Thus, the greedy 
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policy is given as follows: 

a''{t) = arg max F{n{t)) (6) 

a''{t)cAf 

Note we always assume that the greedy policy, a''{t), is the optimal policy at slot t in the rest 
of paper, and then derive the sufficient condition of (3 to guarantee the optimality of the greedy 
policy. Without introducing ambiguity, a''{t) and a^(t) would be used alternatively in the rest. 

III. Standard Reward Function 
A. Feature of Immediate Reward Function 

For simplicity, we assume that uji{t) > W2(t) > ■ ■ ■ > uJk{t), and then use a'^(t) = {1, ■ ■ ■ A;} and 
a^{t) = a;fc(t)} alternatively. The immediate reward = a;/fc(t), a;„(t)) = 

F(cji(t), means choosing the first k channels. Especially, we drop the time slot index of 

0Ji{t), and abuse uJi{t) and altematively without introducing ambiguity. 

Three fundamental while natural assumptions about the immediate reward functions are listed 
as follows: 

Assumption 1. (symmetry) The immediate reward function F(yL{t)) is symmetric about any two 
different channels in a^{t), that is, i,j G a^{t), such that 

F{ui{t), ...Ui(t), UJj{t), ...Un{t)) = F{ui(t), ...Uj{t), Ui{t), ...Un{t)), I <i ^ j <k (7) 

Assumption 2. (affine) The immediate reward function F{Vt{t)) is order 1 [^polynomial of uji{t), 1 < 
i < n, that is, 

Ui{t)F{uji{t), 

+ (1 - UJ,{t))F{uOi{t), Wi-l(t), 0, Ui+i{t), UJn{t)) (8) 

Assumption 3. (monotonicity) The immediate reward function increases monotonically 

with uji{t), I < i < k, that is, 

U[{t) > U^{t) F{ui{t), ...Unit)) > F{ui{t), Ui{t), ...Unit)) (9) 

Note these assumptions are necessary and non-redundant. Moreover, these three assumptions are 
used to define a class of general functions, referred to as standard immediate reward functions. 

Definition 1. A reward function is standard one if it satisfies the aforementioned three assumptions. 

^ F{fl{t)) is affine in each variable if all other variables hold constant 
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In order to see the intrinsic structure of the standard immediate reward function, we give three 
basic examples. 

Example 1. Considering the scenario in flSl where the user gets one unit of reward for each channel 
sensed good. In this example, the expected slot reward function is F{Q) = Yli=i ^i- easily 
verified that F satisfies the above three assumptions and thus is standard. 

Example 2. Considering the scenario where the user gets one unit of reward only if all the channels 
are sensed to be good. Thus the immediate reward is formulated by F{Q) = Y[i=i which is 
standard one. 

Example 3. Consider the scenario in [|9l where the user gets one unit of reward if at least one 
channel is sensed good. In this case, the expected slot reward function is F{Vt) = 1 — ni=i(l 
which is standard by satisfying the three assumptions. 

B. Feature of Accumulative Reward Function 

In this part, some important features of the accumulative reward function Vt{^l{t)) (also called 
value function) will be proved, which consists of the proof base of the optimality of greedy policy 
in the next section. 

Lemma 1. (symmetry) Vt{fl{t)) is symmetric about uJiit), ujjit), 1 < i, j < k, that is, 

Vt{Ui(t), ...Ui(t), ...,Uj(t), ...Unit)) = Vt{Ui(t), ...Uj(t), ...,Ui(t), ...Unit)), 1 < i j < k 

Proof: (l)According to assumption [T] for any l<i^j<km time slot T, since, Vt{^{T)) = 
F(f2(T)), then it is easy to verify Vr(r2(T)) is symmetric. 

(2)Assume Vr^ii^it)), Vt+2{^{t)), Vt+i{^l{t)) are true, then at time t 

Vt{n{t)) = F{n{t)) + pKt{n{t)) 

Based on assumption [H F{^l{t)) is symmetric. By Lemma |9] (Appendix [A]), the second term, 
Kt{^l{t)) of the above formulation is symmetric. Hence, Vt(f2(t)) is symmetric. ■ 

Lemma 2. (ajfine) Vt{fl{t)) is an affine function of Ui(t), 1 <i <n when all other Uj(t), j ^ i, 
1 < j < hold constant. 

Proof: (1) According to assumption |2l in time slot T, F{il{T)) is affine function of Ui{T), 
l<i<n. Hence, Vt{^{T)) = F(fi(T)) is also affine function of Ui{T). 
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(2) Assume Vt-i{^(T - l)),...,Vt+2{^{t + 2)), Vt+i{Q(t + 1)) are affine functions, we prove it 
also holds for slot t. Two cases should be considered as follows: 

Case 1: channel Ui ^ a''{t) = {ui, ...,Uk}: 
Vtm)) = F{n{t)) + f3 Yl n il-^,)Vt+i{pii[\e\],T{uk+i),...,T{un),Poi[k-\e\]) 

eG'P{a'={t)) pGe gGa'={t)\e 

Since is unrelated with cui, Vt+i{^l{t + 1)) is the affine function of cui by the induction 

hypothesis and r^cui) is an affine transform of coi, we have Vt{^l{t)) is the affine function of cUi. 

Case 2: channel cOi E a^{t), let a^~^{t) = a^{t) — {ui], we have 
Vtm)) = nm) + P Yl H^f H (l-c^<^)^m(Pii[|e|],r(a;fc+i),...,rK),Poi[fc-|e|]) 

ee-p{a'={t)) pGe gGa'={t)\e 

m=0 |e|=m pSe 5ea*-l(4)\e 
eeP(a''-i(t)) 

UiVt+i{pn[\e\],pii,T{uk+i), ...,T{uJn),Poi[k - |e|]) 

+ (1 - Ui)Vt+i{pn[\e\],T{ujk+i), ...,T{uJn),Poi,Poi[k - |e|]) 
By assumption 2, Wj, Wfc) is the affine function of cUi. The second term of the right 

hand of the above formulation is also the affine function of Ui. Therefore, Vt{^l{t)) is the affine 
function of cui. Combining the two cases, we have Vt{^l{t)) is the affine function of cui. Lemma [2] 
is concluded. ■ 

Lemma 3. (monotonicity) Vt{Q{t)) increases monotonically with uJi,l < i < n, that is, 

> ujiit) Vt{uJi{t), ■.■,uj'i{t), ...ujnit)) > Vt{uji{t), ...,uji{t), ...ujnit)), l<i<n 

Proof: (1) The lemma holds trivially for slot T considering Vt{^{T)) = F(fi(T)), which is 
the increasing function with coi. 

(2)Assume Vr^i{i^{T — l)),...,\4_,.2(f2(t + 2)), Vt+i{^l{t + 1)) increase monotonically, we prove 
it is true for slot t by two different cases. 

Case 1: channel cui ^ a'^{t): 
Vtm))=F{n{t))+/3 Yl n i^-^,)Vt+i{pn[\e\],T{cok+i),...,TM,Poi[k-\e\]) 

e<^V[a^(t)) pee q£a^(t)\e 

Since is unrelated with a^{t), Vt+i{VL{t + \)) increases with Ui by the induction hypothesis 

and rioji) increases with uji when pn > poi, we have Vt{il{t)) is the increasing function of cOi. 
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Case 2: channel uji G a''{t), let a''~^{t) = a!^{t) — {ui}, we have 
Vmt)) = Fm)) + P Yl n i^-^,)Vt+i{pn[\e\],T{uk+i),...,TM,Poi[k-\e\]) 

eeV{a''(t)) pee qea'=(t)\e 
k-1 

= F{ui,...,Ui,...,ujk) + ^ Ylujp Yl 

m,=0 \e\=m pee gGa'^-i (t)\e 
eg7'(a''-i(t)) 

Wi14+i(pii[|e|],pii,r(u;fc+i), r(u;„),poi[^ - |e|]) 

+ (1 - a;i)\4+i(pii[|e|],r(a;fc+i), ...,T{uJn),Poi,Poi[k - \e\])] 

k-1 

= F{uJi,...,uJi,...,ujk) + Ylujp Yl 

m=0 \e\=m pee q&a''-'^(t)\e 
eeV{a''-\t)) 

Ui[Vt+i{pn[\e\],pu,T{uk+i), r(a;„),poi[fc - |e|]) 

- Vt+i{pn[\e\],T{uk+i), ...,T{un),Poi,Poi[k - |e|])] 

+ Vt+i{pn[\e\],T{ujk+i), ...,T{un),Poi,Poi[k - |e|])] 
The first term, F{uji, ...,uji, ...,uJk)), of the right hand of the above formulation increases mono- 
tonically with coi, and the second term also is the increasing function of cui because 

Vt+i{pn[\e\],pu,T{ujk+i),r{uJk+2), ■ ■ ■ ,T{uJn-i),T{uJn),Poi[k - \e\]) 

- Vt+i{pu[\e\],T{uJk+i),r{ujk+2), ■ ■ ■ ,r{uJn-i),T{ujn),Poi,Poi[k - |e|]) 
= [Vt+iiPii[\e\],pu,T{uJk+i),T{uJk+2), ■ ■ ■ ,T{ujn-i),r{ujn),Poi[k " |e|]) 

- Vt+i{pii[\e\],T{uJk+l),T{uJk+l),T{uJk+2), ■ ■ ■ ,T{uJn-l),T{uj.n,),Poi[k - |e|])] 

(10) 

+ ■■• 

+ [Vt+l{pil[\e\],T{uJk+l),T{uJk+2), ■ ■ ■ ,T{uJn^i),T{uJn),T{uJn),Poi[k - |e|]) 

- Vt+i{pu[\e\],r{uJk+i),T{uJk+2), ■ ■ ■ ,T{uJn-i),r{uJn),Poi,Poi[k - |e|])] 
> 

where, noticing T{uji) increases with coi and poi < t(c^) < Pii when pu > poi, and each item in 
brackets is larger than or equal to zero according to the induction hypothesis. 

We have Vt{fl{t)) increases monotonically with cOi through the two cases and complete the proof. 

■ 

Lemma 4. Vt{Q{t)) is a standard reward function. 

Proof: It is obvious that Vt{^{t)) is a standard reward function according to its definition and 
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Lemma [U [2] and [H ■ 

In this section, we analyze the feature of a class of standard reward function, Vt{^l{t)), of which 
the optimality of greedy policy will be explored in the next section. 

IV. Optimality of Greedy Policy for Standard Reward Function 

In this section, we first give the main theorem of optimality for the class of standard reward 
function, which states the sufficient condition of discounted factor for the optimality of greedy 
policy. After introducing some useful lemmas, we will give the complete proof of the theorem of 
optimality. 

Let cu^i denote the believe vector except the ith element coi, and define 
Fmax = max{ — — } = max {F{l,uj_,) - F(0,w_J}, 

\<t<k OUJi[t) iGAf, cj_iG[0,l]^"^ 

F' A dF{ui{t),...,Ui{t),...,Un{t)) , _ 

l<t<k OUJi[t) iejV, aj_iG[0,l]^"^ 

It is easy to verify that F^^^ > > based on the three basic assumptions. 
The main theorem of optimality is firstly stated as follows: 

Theorem 1. The myopic policy is optimal for poi ^ ^ PiiA ^ ^ ^ ^ if ^'{Qit)) is a 

standard reward function, and the discounted factor (3 satisfies the following condition: 

F' ■ 

</3 < — (11) 

In order to prove the Theorem [1] we introduce some useful lemmas firstly. Note Lemmas |51 16] 
and |7] hold under condition (fTTT ) in the rest of the paper. 

Lemma 5. If k + 1 < i < n — 1, pn > Ui > Wj+i > poi, and (fTTl) is satisfied, 

>0, t = l,-- - ,T. (12) 

Lemma 6. For 1 > uJi{t) > uj2{t) > ... > Unif) > 0, if (fTTl) is satisfied, we have the following 
inequality for all t = 1, 2, T.- 

Vt{Ui, Uk, Un-l, Un) - Vt{Un, Ul, Wfc, W„-l) < -F^^^, t = !,■■■ ,T. (13) 

Lemma 7. If pn > x > y > poi and (fTTl) is satisfied, 

Vt{uji, Uk-i, X, y, ...,uJn) - Vt{uji, cjfc-i, y, x, > 0, t = 1, ■ ■ ■ , T. (14) 

Remark. We would like to point out the complicated dependence in the following proving process 
that Lemma [5] depends on Lemma 2, [6] and |71 Lemma [6| depends on Lemma [6| and |71 Lemma 



11 

U\ depends on Lemma |7] and [6l Therefore, we give the proof of Lemma [51 [6] and |7] together by 
backward induction over time horizon. 

Proof: The proving process is based on backward induction in three steps as follows: 

• step 1: slot T, 

These Lemmas hold trivially in slot T noticing Vt{^{T) = F{n{T))). 
part 1: Lemma |5l 

Vt(^1, CUfc, Wj, Ui+i, , UJn) — Vt{uJi, ■■■,UJk, ■■ -,^1+1,^1, , ■■■,^n) 

= F{ui, ...,ujk) - F{uJi, Uk) = 
part 2: Lemma [6l 

Vt(Wi, Wfc, Un-lyU^n) — VriuJn, Wl, Wfc, ...,Un-l) 
= F{UJI, Wfc^i, Uk) - F{uJn, UJi, Uk-l) 

= {uk - Un)iF{uu ...,Wfc_i, 1) - ...,Wfc-i,0)) < 

where, the second equality is due to Lemma \T\ and |2l 

part 3: Lemma |71 

Vt{uji, Uk-i, X, y, Un) - Vt{ui, ...,Uk-i, y, X, Un) 

= F{uji, x) - F{ui, Wfc-i, y) 

= {x- y){F{uJi, ...,Wfc_i, 1) - F{uji, ...,Wfc-i,0)) 

>{x- y)F^,^ > 
. step 2: slot t + 1, ...,T - 1: 

Now suppose att + l,...,T — 1, Lemma [5] (Induction Hypothesis 1, HSl), [6] (Induction Hypothesis 
2, HS2), and |7] (Induction Hypothesis 3, HS3) are true, we thus prove these Lemmas also hold in 
slot t. 

• step 3: slot t: 
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part 1: Lemma [5l 

= {Ui - Ui+i)(yt{Ui, ...,Ui-i, l,0,Wi+2, , ...,UJn) - Vt{Ui, ...,Wi_i,0, l,a;i+2, 

eeP(a'=(i)) iee jea''{t)\e 
Vt+lipil[\e\],T{uJk+l), ...,T{Ui-i),Pn,Poi,T{Ui+2), ...,T{un),Poi[k - \e\])} 

- {uJi-uJi+i){F{u]i,...,u]k)+ (3 ^ ri'^* n 

Vt+i{pn[\e\],T{uJk+i), ...,r(wi_i),poi,Pii,r(a;i+2), r(cu„),poi[/i; - |e|])} 

ee'P(a'=(t)) iee jga''(t)\e 

Vt+i(pii[|e|],r(a;fc+i), r(wi_i),pii,poi, r(wi+2), r(u;„),poi[/i; - |e|]) 
- \4+i(pii[|e|],r(a;fc+i), r(cUi_i),poi,Pii, r(cUi+2), r(cUn),Poi[A; - |e|])} 
> 

where, a^(t) = {coi, ...,uJk}, the first equality is due to Lemma |2l the inequality is due to the IHl 
if |e| + z — A; — 1 > k, and IH3 if |e| + z — A; — 1 = k — 1, and the Lemma[i|if \e\ + i — k — 1 < k—1. 
part 2: Lemma [6l 

we have the following decomposition according to the Lemma |2] 

ni 

= ^k^n{{^l,^2, 1, •••5 1) " ^(1, 1^1, 1^2, •••5 l^fc-l, I5 l^fc+l, •••5 l^n-l)) 

+ Uk{l - W„)((CU1, CU2, ...,CJfc-l, l,t^fc+l; •••,l^n-l, 

+ (1 — l^fc)l^ri((l^l, 1^2, 0, Wfc+i, UJn-1, 1) — Vt(l, CJi, ^2, 0, Wfc+i, OJ^-i)) 

+ (1 - UJk){l - UJ„){{UJI,UJ2, ...,UJk^i,0,UJk+l, ...,W„_i,0) - Vt{0,UJi,UJ2, ...,UJk-l,0,Uk+l, ...,W„_i)) 

Therefore, we analyze the above formulation through four cases as follows: 
Case 1. The first term of the right hand of the above formulation where channels k and n have 
the state realization "1" and "1", respectively, and denote a^~^{t) = {wi, a;2, Wfc-i}, we thus 
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have 

Vt{uJi,UJ2, ...,UJk~l, 1, CJfc+i, ...,UJn-~l, 1) — Vt{l, Ui, U2, ...,Uk-l, 1, UJk+1, l^n-l) 
= F{ui,U2, Wfe-1, 1) - F{l,Ui,U2, Wfc-l) 

ee7'{a'=-i(t)) i€e j£a''~'^{t)\e 

Vt+i{pu[\e\],pu,T{uJk+i), ...,r(w„_i),r(cj„),poi[fc - 1 - |e|]) 

- Vt+i{pn[\e\],pii,T{ujk),r{ujk+i), ...T{un-i),Poi[k - 1 - |e|])} 

eg'P(a'^-i(t)) iee jea''-'^{t)\e 

Vt+i{pn[\e\],pu,r{uk+i), ...,T{un^i),Pu,Poi[k - 1 - |e|]) 

- Vt+i{pn[\e\],Pn,Pn,T{ujk+i), ...T{un^i),Poi[k - 1 - |e|])} 

where, the first inequality is due to the Lemma |3] according to the similar way as (fTOl ). 

Case 2. The second term of the right hand of the above formulation where channels k and n 
have the state realization "1" and "0", respectively, and denote a^~^{t) = {loi,uj2, ...,uJk-i}, 

0)-Vt{0 

= F{ui, U2, UJk-1, 1) - F{0, UJi,U2, Wfc-l) 
ee-P(a'=-i(t)) iee j6afc-i(t)\e 

^f+i(Pii[|e|],Pii,r(a;fc+i), r(a;„_i),poi, ,Poi[^ - 1 - |e|]) 
- 'l^t+i(j?ii[|e|],Pii,r(a;fc+i), ...T{ujn-i),Poi,Poi[k - 1 - |e|])} 
= F(a;i,a;2, ...,ujk-i, 1) - -F(0, cui, a;2, ...,a;A;_i) 
< F' 

— max 

Case 3. The third term of the right hand of the above formulation where channels k and n have 
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the state realization "0" and "1", respectively, and denote a^~^{t) = {uji,uj2, ...,ujk~i}, 

Vt{Ul,U2, ...,Uk-l, 0, Uk+1, ...,Un-l, 1) — CJi, U2, ...,UJk-l, 0, Uk+l, CJ„_i) 

= F{ui, U2, 0) - Ui, U2, Wfc-l) 

eG'P{a'=-i(t)) «Ge jGa'=-i(t)\e 

^i+i(Pii[|e|],r(wfc+i), ...,r(a;„_i),pii,poi, ,Poi[/^ - 1 - |e|]) 

- ■l4+i(pii[|e|],pii,poi,T-(wfc+i), ...r(a;„_i),poi[^ - 1 - |e|])} 

< F{ui, U2, Uk-l, 0) - Ui, U2, Wfc-l) 

ee-p(a'=-i(t)) jee jga'=-i(t)\e 

^t+ibii[|e|],^(^fc+i), ...,r(a;„_i),pii,poi, ,Poi[A; - 1 - |e|]) 

- Vt+i{pii[\e\],poi,pu,T{ujk+i), ...T{uJn^i),poi[k - 1 - |e|])} 

^t+i(Pii[|e|],r(u;fc+i), ...,T{un-i),pu,poi, ,Poi[k - 1 - |e|]) 

- 't4+i(l'oi,Pii[|e|],Pii,r(a;fc+i), ...T{un-i),Poi[k - 1 - |e|])} 

^i+i(Pii[|e|],r(u;fe+i), r(a;„_i),pii,poi, ,Poi[^ - 1 - |e|]) 

+ Finax - Vt+i{pu[\e\],Pu,T{uJk+l),...T{uJn-l),P01,P0l[k - 1 - |e|])} 

< < F' 

— max — max 

where, the first inequality is due to IH3 when |e| + 1 = k, the second one due to the IH2, and the 
second equality due to Lemma [T] when \e\ + 1 < k, noticing < |e| < A; — 1. 

Case 4. The forth term of the right hand of the above formulation where channels k and n have 
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the state realization "0" and "0", respectively, and denote a^~^{t) = {uji,uj2, ...,ajfc„i}, 

Vj(cJi, U2, ...,Uk-l, 0, CJfc+l, ...,Un-l, 0) — Vj(0, Ui, U2, ...,Uk-l, 0, CUfc+i, ...,Un-l) 

= F{ui, U2, Uk-1, 0) - F(0, wi, W2, Wfc-i) 

ee'P(a*:-l{t)) «ee jea'^-l{t)\e 

V^t+i(pii[|e|],r(a;fc+i), r(a;„_i),poi,Poi,Poi[^ - 1 - |e|]) 

- Vt+i{pu[\e\],poi,T{ujk+i), ...T{ujn~.i),Poi,Poi[k - 1 - |e|])} 

\4+i(pii[|e|],r(a;fc+i), r(w„_i),poi,Poi,Poi[/^ - 1 - |e|]) 

- Vt+i{pu[\e\],poi,T{uJk+i), ...T{ujn-i),Poi,Poi[k - 1 - |e|])} 

eg'P(a''-i(t)) «Ge j£a''-'^{t)\e 

Vt+i{pn[\e\],T{uJk+i), ...,T{uJn-2),r{uJn-i),Poi,Poi,Poi[k - 1 - |e|]) 

- Vt+i{poi,pu[\e\],T{uJk+i), ...T{ujn~i),poi,poi[k - 1 - |e|])} 

Vt+i{pn[\e\],T{uJk+i),T{uk+2), ...,T{ujn~2),r{ujn-i),Poi,Poi,Poi[k - 1 - |e|]; 
+ FLax - yt+i{pn[\e\], r(wfc+i), ...r(a;„_i),poi,Poi,Poi[^ - 1 - |e|])} 
< 13F' 

— max 

where, the first inequality is due to the IH2 and the third equality is due to Lemma [IJ 
Combing the results of cases 1, 2, 3, and 4, we have 

< F' 

— max 

To this end, we complete the proof of Lemma |6l 
part 3: Lemma |71 
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= {x- y){Vt{uJi, uJk-i, 1, 0, ...,ujn) - 14(^1, uJk-1, 0, 1, 
= (x - ...,ujk~i, 1) - (c^i, ...,Wfc-i,0)) 

+{x-y)(3 Yi n^^ n 

eeP(a'=-l(t)) jee jea'=-i{t)\e 

V^i+i(Pii[|e|],Pii,Poi,i"(wfc+2), ...,r(w„),poi[^ - 1 - |e|]) 
- 'l4+i(Pii[|e|],Pii,r(a;fc+2), ...r(a;„),poi,Poi[^ - 1 - |e|])} 
> (x -y){F{uji,...,uJk^i,l) - F{uji,...,ujk~i,0)) 

N 



-f^(x-y){l- H {1-u,))F:, 



j=k+2 

N 



>{x-y)Fi^^-P{x-y){l- \[{l-uj,))Fi 



j=k+2 

N , ' N 

=ix-y)ii- n (1 - ^^))FL..i^ (1 - n (1 - ^^•)) - 

r max 

j=k+2 j=k+2 

^ F' 

> 

where, the third inequality is due to condition (fTTTl and the first inequality is due to the following 
inequality formulation, 

AV = Vt+i{pu[\e\],pn,Poi,r{uJk+2), r(w„),poi[fc - 1 - |e|]) 

- Vt+i{pu[\e\],pu,T{uJk+2), ...T{u)n),Poi,Poi[k - 1 - |e|])} ^^^^ 

N 
j=k+2 

Note, if r(u;fc+2) = ■ ■ ■ r(aj„) = Poi^ then AV^ = 0. This event happens with the probability equaling 
to njLA;+2(-'- ~ '^j)- Thus with the probability 1 — njLfc+2(l ~ ^j)^ exists at least i, k + 2 < i < n 
such that t{uj,i} > poi- According to the IH2 and IH4, we have AV" > — with probability 

1 - nf=fe+2(l - ^j )' which is d- 

Therefore, we finish the whole proving process of Lemmas |5] |6] and Ul 

■ 

After obtaining the Lemmas |5l |6l and Ul we are ready to prove the Theorem [H 
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Proof: The basic approach is by induction on t. It is obvious that the myopic policy is optimal 
at T. Now, assuming the optimality of the myopic policy for t + 1, ...,T — 1, we shall show the 
myopic policy is also optimal for t. Denote {ii, ■ ■ ■ as any one of permutations of M. To 
prove the optimality of greedy policy in slot t, we need to prove 

Vt{ui, ■ ■ ■ ■ ■ ■ > Vt{uJi^, ■ ■ ■ ■ ■ ■ (16) 

The proving process is same as the Bubble Sort algorithm, comparing each pair of adjacent items 
and swapping them if they are in the wrong order according to Lemma [H [5] and |7] until no swaps 
are needed, which indicates that the list is sorted to Vt(a;i, • ■ ■ ,uJk, ■ ■ ■ ,uJn)- The optimality of 
greedy policy at slot t is guaranteed. Therefore, the Theorem [T] is concluded. ■ 

Corollary 1. The greedy policy is optimal if choosing 1 out of n channels for < /3 < 1 if 

Pii > Poi- 

Proof: When k = 1, according to Lemmas [T] |2] and IH we have F{il{t)) = auOi{t), a > 0, 
thence, p' ^ 

min _ > ^ 



According to Theorem [1] we have the conclusion. ■ 
Corollary 2. The greedy policy is optimal if choosing n — 1 out of n channels for < /3 < L 
Proof: In case of A; = n — 1, we have 

^ oo (18) 



F ■ 

rmn 



\N~k-l] 



k=N~l 



Hence, the greedy policy is optimal according to Theorem [T] ■ 

V. Applications in Cognitive Radio Network 

To illustrate the application of the mathematical results derived in the previous section, three typ- 
ical scenarios [[8l [|9l described by standard reward function are presented here, which demonstrate 
that the different optimality conditions are completely due to different forms of the immediate 
reward function. 

A. Application 1 

An application is in a synchronously slotted cognitive radio network where a SU can opportunis- 
tically access a set J\f of N i.i.d. channels partially occupied by PUs. The state of each channel 
i in time slot t, denoted by Si{t), is modeled by a discrete time two-state Markov chain. At the 
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beginning of each slot t, the SU selects a subset A{t) of channels to sense. If at least one of the 
sensed channels is in the idle state (i.e., unoccupied by any PU), the SU transmits its packet and 
collects one unit of reward. Otherwise, the SU cannot transmit, thus obtaining no reward. These 
decision procedure is repeated for each slot. The objective is to maximize the average reward over 
T slots, that is to say, the discounted factor (3 = 1. 

Obviously, we have the immediate reward function as follows: 

F{n{t)) = i- n 

Therefore, the greedy policy is to choose the best k channels by ©. According to Theorem [1] we 
have = (1 - Poi)''~\ F^in = (1 - Pii)''~^ if Poi < t^i(O) < Pu, 1 < i < n. Therefore the 
greedy policy, choosing the best k out of n channels, is optimal if the discounted factor /3 satisfies 
the following condition: 

o<0< iLzPiiTl 

- (i-Poi)'-Hi-(i-Pii)'^-'=-') 

Obviously, the upper bound cannot achieve 1 generally. Thus, the greedy policy, in general, is not 
optimal for the average reward over time horizon proved in our previous work [|9]. In particular, the 
greedy policy, choosing the best k = 1 or n — 1 out of n channels is optimal for /3 = 1 according 
to the corollary [1] and |2] 

B. Application 2 

Consider the problem of probing n independent Markov chains. Each one has two states-good 
(1) and bad (O)-with transition probabilities Pii,Poi across chain. Assuming pu > poi- A player 
selects k chains to probe according to its preference (policy) and obtain a reward for each probed 
chain in the good state. We assume that the reward is affine function of the probability of the 
selected channel in the good state, i.e., Ui{t) = aui{t), a > 0, then we have the immediate reward 

function as follows: n 

F{Q{t)) = aJ2^iit) 
1=1 

Since F^^^^ = -F^j„ = a, thus, 
we have the following conclusion about this problem by Theorem [TJ 

Lemma 8. The greedy policy of choosing the first k best channels is optimal for < /3 < 1. 
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Obviously, this result is consistent with [|71 [HI. 
C. Application 3 

Consider the scenario where a player detects n independent Markov chains. Each one has two 
states-good (1) and bad (O)-with transition probabilities pu,Poi (Pii > Poi) across chain. The 
player selects k chains to detect according to its policy and obtain one unit of reward if all 
detected channels are good; otherwise , no reward. We assume that the probability of i channel in 
good state at time t is u)i{t), then we have the immediate reward function as follows: 

Fim) = ^=Mt) 

Therefore, the greedy policy is to detect the first k best channels, and -F^aa; = Pii^^ P'min = Pm^- 
We have the following conclusion by Theorem [TJ 

</3 < 



Poi 



p\-\l - [I - p,,)n-k~l) 

So in case ofl</c<n — 1 the greedy policy is not optimal generally for /3 = 1, while choosing 
the best k = lork = n— l out of n channels is optimal for < /3 < 1. 

VL Conclusion 

In this paper, we considered a class of POMDP problem arisen in the fields of cognitive radio 
network, server scheduling, and downlink scheduling in cellular systems, characterized by the so- 
called standard reward function. For this class of POMDP, we establish the optimal condition of 
the greedy policy only focusing the maximization of the immediate reward. The technical approach 
analyzing this problem is purely mathematical, and thus is general for other models involving the 
recursive backward induction on the time horizon. The future direction is to investigate non i.i.d 
Markov chain model through the proposed method, and another more challenging work is to extend 
the standard reward function by dropping at least one of three basic assumptions. 

Appendix A 
Proof of Lemma [9] 

Lemma 9. Assume a^(t) = {ui{t), ■ ■ ■ ,Uk(t)}, Kt{Q(t)) is symmetric about uJi(t),Uj(t) for all 
1 < ^5 J < ^. fhat is, 

Kt{Ui{t), ■ ■ ■ ,UJi{t), ■ ■ ■ ,UJj(t), ■ ■ ■ ,Un{t)) = Kt{uJi{t), ■ ■ ■ ,UJj(t), ■ ■ ■ ,UJi{t), ■ ■ ■ ,Un{t)) 
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Proof: Let 

KT{m)= n (l-^.)^m(Pii[|e|],r(a;fc+i),---,rK),Poi[^-|e|]) (19) 

\e\=m 

Therefore, ^ 

K,{Q{t)) = Y,Krim) (20) 

m=0 

Since ^t+i(pii[|e|], r(a;fc+i), ■ ■ ■ , r(c(;„),poi[^]) is unrelated with a'^(t), we only need to prove the 
k + 1 coefficients is symmetric about uJi{t),ujj{t) for all 1 < i,j < k, that is, 

^"'= Yl n^^' n (1-^^)' 0<m<A; 

|e|=m 

is symmetric about Ui{t),Uj{t). Based on the feature of power set V{a^{t)), it is simple to verify 
that C[", (0 < m < /c) is symmetric about any two uji(t),Uj{t) G a''{t). Therefore, is 
symmetric about uji(t),ujj{t) E a!'{t). 
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